Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score

With the increasing adoption of electronic health records, there is an increasing interest in developing individualized treatment rules, which recommend treatments according to patients’ characteristics, from large observational data. However, there is a lack of valid inference procedures for such rules developed from this type of data in the presence of high-dimensional covariates. In this work, we develop a penalized doubly robust method to estimate the optimal individualized treatment rule from high-dimensional data. We propose a split-and-pooled de-correlated score to construct hypothesis tests and confidence intervals. Our proposal adopts the data splitting to conquer the slow convergence rate of nuisance parameter estimations, such as non-parametric methods for outcome regression or propensity models. We establish the limiting distributions of the split-and-pooled de-correlated score test and the corresponding one-step estimator in high-dimensional setting. Simulation and real data analysis are conducted to demonstrate the superiority of the proposed method.


INTRODUCTION
An individualized treatment rule (ITR) is a decision rule that maps the patient profiles X ∈ R p into the intervention space A ∈ A, where p is the number of the covariates and A is the set of available interventions.Given an outcome of interest, the optimal ITR is the ITR maximizing the value function which is the mean outcome if it were applied to a target population.Understanding the driving factors of a data-driven ITR can help with identifying the source of the heterogeneous effects and with guiding practical applications of precision medicine.
The increasing adoption of electronic health records (EHRs) at hospitals and healthcare centers has provided us unprecedented opportunities to identify and understand the optimal ITR through massive observational data.One of the difficulties in dealing with observational data is the high dimensionality of the covariates.There have been various methods developed to estimate the optimal ITR, and some of them can be applied with the presence of high-dimensional covariates.For regression-based approaches, Q-learning methods (Watkins & Dayan, 1992;Chakraborty et al., 2010;Qian & Murphy, 2011;Laber et al., 2014a) pose a fully specified model assumption on the conditional mean of the outcomes given the covariates and treatments.Qian & Murphy (2011) consider a rich linear model to approximate the conditional mean, along with an l 1 penalty to obtain the estimated rule from high-dimensional data.A-learning methods (Murphy, 2003;Lu et al., 2013;Shi et al., 2016Shi et al., , 2018) ) pose a model assumption on the contrast function of the conditional means.With high-dimensional covariates, Shi et al. (2016Shi et al. ( , 2018) ) adopt penalized estimating equation or penalized regression with a linear contrast function.An alternative class of methods is to optimize the estimator of the mean outcome as a function of the ITR over a prespecified class of ITRs for the best one, usually called direct (Laber et al., 2014b), policy learning (Athey & Wager, 2017) or value-search (Davidian et al., 2014) estimators.Among these methods, Zhao et al. (2012) propose the outcome weighted learning approach, which constructs the optimal ITR based on an inverse probability weighted estimator of the value, with the involved indicator replaced by a convex surrogate.Song et al. (2015) develop a variable selection method based on penalized outcome weighted learning for optimal individualized treatment selection.
Statistical inference for the optimal ITR is particularly challenging in the presence of highdimensional covariates.Estimating ITRs from large observational data such as EHR data are susceptible to confounding and selection bias, which add one more layer of complexity.Liang et al. (2018) propose a concordance-assisted learning algorithm to estimate the optimal ITR in the presence of high-dimensional covariates.Nonetheless, they do not provide any inference procedures.Inference methods for A-learning approaches such as Song et al. (2017) and Jeng et al. (2018) are developed assuming the propensity score is known.Thus, their methods cannot be applied if data are collected from observational studies.Shi et al. (2018) derive the oracle inequalities of the proposed estimators for the parameters in a linear contrast function, but their work focuses on the selection consistency and has little discussion on the inference of the optimal ITR.Furthermore, their method is not robust to the misspecification of the logistic propensity score model.In practice, to avoid misspecification, flexible models may be adopted for the outcome regression or the propensity score.However, these models result in slow convergence rates for the nuisance parameters, and deteriorate the limiting distribution of the estimated ITRs.As such, it is important to propose an inference procedure for the optimal ITR, which is valid under the highdimensional setup and robust to flexible models for the nuisance parameters.Recent literature on the high-dimensional inference can assist with tackling this challenge.For example, van de Geer et al. (2014) propose a debiased Lasso approach for generalized linear models.Ning & Liu (2017) propose a de-correlated score test for low dimensional parameters with the existence of the high-dimensional covariates, which is applicable for parametric models with correctly specified likelihoods.Dezeure et al. (2017) propose a bootstrap procedure for high-dimensional inference, but it is computationally intensive.
In this work, we propose a novel penalized doubly robust approach, termed as penalized efficient augmentation and relaxation learning (PEARL), to estimate the optimal ITR in observational studies with high-dimensional covariates.We construct the ITRs by optimizing a convex relaxation of the augmented inverse probability weighted estimator of the value with penalties, which generalizes the method proposed in Zhao et al. (2019) to high-dimensional setup.The proposed procedure involves estimation of the conditional means of the outcomes and the propensity scores as nuisance parameters.As long as one of the nuisance models is correctly specified, we can consistently estimate the optimal ITRs under certain conditions.Furthermore, we propose a split-and-pooled de-correlated score test, which provides valid hypothesis testing and interval estimation procedures to identify the driving factors of the optimal ITR.The proposed procedure generalizes the de-correlated score (Ning & Liu, 2017) to handle the potential slow convergence rates from the nuisance parameters estimation and to allow a general loss function.Samplesplitting is adopted to separate the estimation of the nuisance parameters from the construction of the de-correlated score, which is utilized in Chernozhukov et al. (2018) for inference on a low-dimensional parameter of interest in the presence of high-dimensional nuisance parameters.However, the inference on optimal ITRs using the proposed PEARL approach requires a more sophisticated analysis due to the convex relaxation schemes.Theoretically, we show that the split-and-pooled de-correlated score is asymptotically normal even when the nuisance parameters are estimated non-parametrically with slow convergence rates.We also show that a sample-splitting procedure is not required to derive the limiting distribution of the split-andpooled de-correlated score, if the nuisance parameters are assumed to follow parametric models.In addition, the proposed method applies to the high-dimensional inference based on general loss functions with nuisance parameters.
This paper is organized as follows.In Section 2, we introduce our estimation and inference procedures for the optimal ITR for high-dimensional data.In Section 3, we provide the theoretical properties of the proposed procedure.We conduct simulation studies in Section 4. In Section 5, the method is applied to a dataset that contains linked claims and EHR data for Medicare beneficiaries on complex diabetes patients.We provide a discussion in Section 6.

METHOD
In this section, we first present the penalized efficient augmentation and relaxation learning (PEARL) approach for the ITR estimation.We then propose an inference procedure in the presence of high-dimensional covariates, which provides results for hypothesis tests and confidence intervals.

Penalized efficient augmentation and relaxation learning
Let X be a random vector of dimension p × 1, which contains the baseline or pre-treatment covariates capturing patient profiles.We assume that p can be much larger than the sample size n.Let A ∈ {−1, 1} be the treatment assignment, and Y ∈ R be the observed outcome that higher values are preferred.Here, we adopt the framework of potential outcomes (Rubin, 1974(Rubin, , 2005)).Denote the potential outcome under treatment a ∈ {−1, 1} as Y (a).Then the observed outcome is Y = Y (a)I{a = A}, where I{•} is the indicator function.An ITR, denoted by D, is a mapping from the space of covariates X ⊆ R p to the space of treatments A = {−1, 1}.With a slight abuse of notation, we write the observed outcome under this ITR as , is called the value function which is the average of the outcomes over the population if the ITR were to be adopted.In order to express the value in terms of the data generative model, we assume the following conditions: 1) the Stable Unit Treatment Value Assumption (SUTVA) (Imbens & Rubin, 2015); 2) the strong ignorability SUTVA condition assumes that the potential outcomes for a patient do not vary with the treatments assigned to other patients.It also implies that there are no different versions of the treatment.The strong ignorability condition means that there is no unmeasured confounding between the potential outcomes and the treatment assignment mechanism.The optimal ITR, D opt (X) = arg max D {V (D)}, is the ITR that leads to the largest value function.
In this paper, due to the high-dimensional nature of the data we work with, we focus on deriving a linear decision rule of the form D(x) = sgn(x ⊤ β), where x ∈ X .In general, D opt (x) could be a complex function of x, but in many situations, the optimal rule D opt (x) may only depend on a linear function of x (Xu et al., 2015).We assume that D opt (x) = sgn(x ⊤ β opt ), which also indicates Under the conditions above, the augmented inverse probability weighted estimator of the value function is where E n denotes the empirical average, and π(a; X) and Q(a; X) are the estimators of π(a; X) and Q(a; X) respectively.Assume that pr{D(X) = 0} = 0. Define Assume that Q(a; x) and π(a; x) converge in probability uniformly to some deterministic limits, denoted by Q m (a; x) and π m (a; x), respectively.V (D) converges to V m (D), where Here, W m a = W a (Y, X, A, π m , Q m ) is the limit that Ŵa converges to, a = ±1, and W m a,+ (W m a,− ) are W m a 's postive (negative) part.As shown in Zhao et al. (2019), if either π m (a; Directly optimizing (1) is infeasible due to the indicator function in the objective function, especially with a large number of covariates.We propose a PEARL estimator of the optimal ITR.In particular, we consider a relaxation of (1), where we replace the indicator function with a convex surrogate loss.Furthermore, we add a sparse penalty function, which enables us to eliminate the unimportant variables from the derived rule.Denote Ω+ = Ŵ1,+ + Ŵ−1,− and where φ is a convex surrogate loss, P (β) is a sparse penalty function with respect to β, and λ n is a tuning parameter controlling the amount of penalization.In this paper, we focus on the L 1 lasso penalty P (β) = β 1 .The framework allows a broad class of surrogate loss functions, such as logistic loss, φ(t) = log 1 + e −t , see Section 3 for the detailed technical conditions on φ.The estimated ITR can be subsequently obtained as D(X) = sgn(X ⊤ β).

Split-and-pooled de-correlated score test
We define and To simplify notations, we will suppress the superscript and write them as Without loss of generality, suppose that β * 1 is of interest.The statistical inferential problem can be formulated as testing the null hypothesis The proposed method can be easily generalized to the setting where β * 1 is multi-dimensional.Lemma 1 provides sufficient conditions that β * satisfies D opt (X) = sgn(X ⊤ β opt ) = sgn(X ⊤ β * ), which indicates the inference on β * is equivalent to that on D opt .LEMMA 1.If the optimal ITR D opt has a linear form, and The condition (a) excludes the situation where β opt = 0.When all outcomes are positive, The condition (c) on the design matrix X is common in the dimension reduction literature (Ma & Zhu, 2012, 2013), and known to be satisfied if the distribution of X is elliptically symmetric.
Remark 1.While Lemma 1 establishes a certain relationship between the optimal ITR and the ITR under the surrogate loss, the interval estimation is less interpretable because any positive scaling of β opt still satisfies the conditions in Lemma 1.However, a hypothesis testing on β opt is meaningful in the sense that when β opt 1 , the first coordinate of β opt , is 0 (or non-zero), any non-zero scaling of it is also 0 (or non-zero).This is the main reason for us to generalize the decorrelated score test rather than the debiased Lasso approach (van de Geer et al., 2014), which mainly focuses on construction of confidence intervals.
Alternatively, instead of assuming the conditions in Lemma 1, the desired relationship D opt = sgn(X ⊤ β * ) may still hold under some parametric assumptions on E For example, if the following conditions are satisfied for some measurable function h(X), we still have D opt = sgn(X ⊤ β * ) .Condition (3) poses a parametric assumption on E[Ω + |X]/E[Ω − |X] (see supplementary material for the details).This ratio measures the relative change of the potential outcomes.Under Condition (3), hypothesis testing of β * is equivalent to testing for the optimal ITR D opt .Furthermore, the interval estimation of β * can be interpreted through the specified model assumption in (3).
Suppose that Ω + and Ω − are known, then the estimator β is obtained by minimizing the empirical loss , where β−1 is a p − 1 dimensional sub-vector of β without β1 .In the low dimensional setting where p is fixed, the score function with βnull , E n ∇l φ ( βnull ; Ω + , Ω − )X 1 , is asymptotically normal.Nevertheless, in a high-dimensional setting, the asymptotic normality of the score function E n ∇l φ ( βnull ; Ω + , Ω − )X 1 is deteriorated by the high dimensionality of β−1 .Following Ning & Liu (2017), we utilize the semiparametric theory to de-couple the estimation error of β−1 with the score function of β 1 .A de-correlated score function is defined as where w * = I * −1,−1 −1 I * −1,1 is chosen to reduce the uncertainty of the score function due to the estimation error of β−1 , and I * −1,−1 and I * −1,1 are the corresponding partitions of Under the null hypothesis, this de-correlated score function follows where We propose to estimate the nuisance parameter w * via where λn is a tuning parameter.A valid test for The nuisance parameters, Ω + and Ω − are unknown in practice, and are estimated via modeling π and Q.To avoid misspecification, they can be estimated using flexible nonparametric or machine learning methods, which may lead to convergence rates slower than n −1/2 .To overcome the possible slow convergence rates of π and Q, we propose a split-and-pooled de-correlated score, where we consider a sample split procedure in constructing the de-correlated score function (Chernozhukov et al., 2018).
Let I 1 , . . ., I K be a random partition of the observed data with approximately equal sizes, where K ≥ 2 is a pre-specified positive integer.We assume that ⌊n/K⌋ ≤ |I k | ≤ ⌊n/K⌋ + 1, for all k = 1, . . ., K. Let E (k) n [•] denote the expectation defined by the data in I K .For each k ∈ {1, . . ., K}, we repeat the following procedure.First, we obtain π(−k) and Q(−k) using the data excluding I k .In the presence of high-dimensional covariates, we can use generalized linear model with penalties (van de Geer, 2008) or kernel regression after a model-free variable screening (Li et al., 2012;Cui et al., 2015) for estimating π and Q.A data-split PEARL estimator β(k) is obtained by where Ω(−k) where λn,k is a tuning parameter.Let , where Finally, we construct the data-split de-correlated score test statistic S (k) ( β(k) null , ŵ(k) ) as Combining K data-split PEARL estimators, we can obtain the pooled PEARL estimator as Likewise, the pooled de-correlated score test statistic is The data-split procedure de-correlates the estimation errors of π and Q with the estimation of β(k) and the construction of the de-correlated score.Therefore, flexible nonparametric or machine learning methods can be used to estimate π and Q.As shown in Theorem 2, under null hypothesis, we have The detailed algorithm is provided in Algorithm 1.In this algorithm, for a fixed 1 ≤ k ≤ K, π(−k) and Q(−k) are trained on a subset of samples of size n(K − 1)/K.The sample splitting is the key to allow for flexible nonparametric or machine learning estimates, which extends the scope of the original de-correlated score approach (Chernozhukov et al., 2018).However, when both π and Q are estimated parametrically, quantification of the estimation errors of π and Q is tractable by using Taylor expansion (Chernozhukov et al., 2018).In this case, we can directly use the whole dataset to estimate the nuisance parameters.This algorithm is summarized in Algorithm 2. Algorithm 1. Inference of β * using a sample-split procedure , where , and the estimator of the variance σ2 Calculate the p-value by 2 (1 − Φ(|S|/σ)), where Φ(•) is the cumulative distribution function of a standard normal distribution.
Algorithm 2. Inference of β * with parametric propensity and outcome model estimations Input: n samples.
Output: β and a p-value for H 0 : β * 1 = 0. Use all data to fit a parametric regression model with a lasso penalty and obtain an estimator π0 for the propensity and an estimator Q0 for the outcome model; Obtain the PEARL estimator β by min β E n l φ β; Ω+ , Ω− + λ n β 1 , where Ω+ and Ω− are computed with π0 and Q0 plugged in, and λ n is tuned by cross-validation; Obtain an estimator ŵ for w * by , and the estimator of the Calculate the p-value by 2 (1 − Φ(|S|/σ)), where Φ(•) is the cumulative distribution function of a standard normal distribution.

Confidence intervals
In this section, we use the data-split de-correlated score to construct a valid confidence interval of β * .This is motivated from the fact that the data-split de-correlated score S (k) β, ŵ(k) is also an unbiased estimating equation for β * 1 when fixing β −1 = β * −1 .However, directly solving this estimating equation has several drawbacks, such as the existence of multiple roots or ill-posed Hessian (Chapter 5 in van der Vaart (2000)).Ning & Liu (2017) proposed a one-step estimator, which solved a first order approximation of the de-correlated score.Following their procedure, we construct the data-split one-step estimator, β(k) 1 , as the solution to, Hence, Finally, the pooled one-step estimator is the aggregation of these data-split one-step estimators following In Section 3, we will show the asymptotic normality of the pooled one-step estimator β1 , which provides a valid confidence interval for β * 1 .The algorithm for constructing confidence intervals is presented in Algorithm 3. Algorithm 3. Confidence interval of β * 1 using a sample-split procedure Input: The data-split de-correlated score S (k) β(k) , ŵ(k) and Î(k) 1|−1 for k = 1, . . ., K; σ2 from Algorithm 1. Output: A 95% confidence interval for β * 1 .Construct the data-split one-step estimator by

THEORETICAL PROPERTIES
In this section, we investigate the theoretical properties of the proposed procedures.We assume the following conditions.
In addition, we require that and where s * = β * 0 and s ′ = w * 0 .
Condition (C1) on the joint distribution of (X, A, Y ) is commonly assumed in high-dimensional inference literature (van de Geer et al., 2014;Ning & Liu, 2017).For technical simplicity, we assume that the design is uniformly bounded in the (C1).We also assume that Y (a) − Q(a; X) is sub-exponential or bounded.This condition enables a faster convergence rate of highdimensional empirical processes involving the estimation errors of π and Q.Under this con-dition, if sup X Q(a; X) − Q(a; X) = o p (1), we have Condition (C2) prevents the extreme values in the true propensities.Condition (C3) guarantees the Fisher consistency (Bartlett et al., 2006).Condition (C4) and ( C5) are technical conditions for the loss function (Ning & Liu, 2017;van de Geer et al., 2014).In particular, condition (C4) requires that the population Hessian of the loss function l φ is not ill-posed, and this negates any loss functions with a trivial second derivative such as the hinge loss.Condition (C5) characterizes the nonlinearity of the surrogate loss.It assumes that φ ′′ is positive and bounded, which is satisfied for a strictly convex loss on a compact set.The assumption is that |φ ′′ (t 1 ) − φ ′′ (t)| ≤ C|t 1 − t|φ ′′ (t) is related to the so-called self-concordance property (Bach, 2010), which can be satisfied by a broad class of loss functions, for example, logistic loss.
Condition (C6) is imposed for Algorithm 1.We assume that it holds on each split dataset.To simplify the notation, we do not distinguish π and Q with π(−k) and Q(−k) for a fixed k.First it requires that both π and Q are consistent and the convergence rates satisfy n −α−β ≪ n −1/2 .This can be attained if either the convergence rate of π or Q is sufficiently fast.For example, if π is estimated by a regression spline estimator and is known to be p π -dimensional (low dimension) by design, we have sup X |π(a; X) − π(a; X)| = O p n −1/3 , where π is assumed to belong to the Hölder class with a smoothness parameter greater than 5p π (Newey, 1997).Then n −α−β ≪ n −1/2 is satisfied when n −β ≪ n −1/6 .Second, (5) in Condition (C6) requires that the number of nonzero entries of β * and w * is smaller than the order of n 1/2 / log p, which agrees with the conditions in the high-dimensional inference literature (van de Geer et al., 2014;Ning & Liu, 2017).Finally, (6) of Condition (C6) indicates the convergence rates of the nuisance parameter estimations cannot be too slow if s * increases fast with the sample size n.
Condition (C6') provided below is parallel to Condition (C6) for Algorithm 2, where we estimate both nuisance parameters parametrically using the entire sample.
(C6') Suppose that π(a; X) and Q(a; X) are known to follow parametric models π(a; X, β π ) and Q(a; X, β Q ) with true parameters β * π and β * Q respectively.Assume π(a; X, β π ) and Q(a; X, β Q ) are second order continuously differentiable with respect to β π and β Q , and are bounded for a = 1 and −1.
Further, there exist constants , where for two matrices A and B, A ≺ B implies that where s * = β * 0 and s ′ = w * 0 .
It can be verified that penalized generalized linear models satisfy Condition (C6') under certain conditions.For example, if π is estimated using a logistic regression with a lasso penalty and Q is estimated using a linear regression with a lasso penalty, we have βπ In addition to the requirement that max {s * , s ′ } log p/n 1/2 → 0, Condition (C6') also requires that THEOREM 1. Assume that Conditions (C1)-(C5) and Condition (C6) hold.By choosing λ n,k ≍ (log p/n) 1/2 , we have Theorem 1 assumes that both the outcome and propensity score models are correctly specified, Q m = Q and π m = π (implied by Condition (C6)).Nonetheless, our PEARL estimator enjoys the doubly robustness property in the sense that This also indicates that as long as one of the estimators π and Q has a fast rate, the convergence rate of β established in Theorem 1 is preserved.Theorems 2 and 3 provide the limiting distributions of the testing procedures in Algorithm 1 and the pooled one-step estimator β1 in Algorithm 3 via sample-splitting, respectively.

THEOREM 2. Assume that Conditions (C1)-(C6) hold. For Algorithm 1, under the null hypothesis H
and σ2 → σ 2 , where σ2 is given in Algorithm 1, and THEOREM 3. Assume that Conditions (C1)-(C6) hold.The pooled one-step estimator satisfies where Remark 2. Theorems 2 and 3 assume that both the propensity and the outcome models are correctly specified and estimated.Nonetheless, when the propensity score is known by the design of the experiment, the conclusions in Theorems 2 and 3 still hold even if the outcome model is misspecified.In contrast, Q-learning requires correctly specified outcome models even when the propensity is known.In practice, ITR can still be linear even if the contrast function is non-linear.As such, our modeling framework is more flexible.The advantages of our methods extend to the high-dimensional setting.The outcome weighted learning approach does not involve modeling outcomes.However, the corresponding penalized estimator in the outcome weighted learning approach may have a slower convergence rate than the proposed estimator in Theorem 1 when the propensity score is estimated with a slow rate.Therefore, the de-correlated score or the one-step estimator based on the outcome weighted learning approach cannot achieve a limiting distribution with n 1/2 convergence rate as in Theorems 2 and 3.
Finally, under Condition (C6'), the following theorem provides theoretical results of the decorrelated score test with parametric propensity and outcome model estimations in Algorithm 2. variable screening procedure (Li et al., 2012).We then fit a kernel regression using the selected variables after screening.When estimating π, we set caps at 0.1 and 0.9 to trim extreme values.
In all scenarios, the sample size n and the dimension p range from 100, 200, 350, 500, to 800.We set the nominal significant level at 0.05, and the nominal coverage at 95%.We report the type I errors, the powers of the hypothesis tests, and the value functions under the estimated ITRs out of 1000 replications.In particular, we present the type I errors for testing β * 5 to β * 8 , and the powers for testing β * 1 to β * 4 .For each method, we also present the coverage of the interval estimations around the limiting coefficients.
Figures 1-6 show the simulation results for different scenarios, with the sample size n varied and the p and ξ fixed.Additional results on varying p with n and ξ fixed can be found in the supplementary material.As expected, in Scenarios (I) (Figure 1) where the regression model is correctly specified for Q-learning, Q-learning yields a better value function.Conversely, the proposed PEARL outperforms the Q-learning method in Scenario (II) and (III) (Figures 3 and 5, respectively).In terms of the type I error and power, the proposed method is comparable to the Q-learning approach in Scenario (I) (Figure 1).For Scenarios (II) and (III) (Figures 3 and 5, respectively), our method is more powerful, and the type I errors are well controlled.The power reduction for the Q-learning approach may be due to the model misspecification.The coverage of β * 5 to β * 8 are concentrated near 95%, and the coverage of the β * 1 to β * 4 gradually approach 95% for the proposed method.Conversely, the Q-learning approach seems to require a large sample size for a valid confidence interval in some cases.
In summary, the proposed method has a comparable performance to Q-learning when the model is correctly specified (Scenario (I)).The strength of the PEARL method is shown in Scenarios (II) and (III), when the model is misspecified in Q-learning.The proposed procedure achieves controlled type I errors and higher powers in hypothesis testing, even when the nuisance parameters are estimated in a nonparametric fashion.For all the scenarios, the interval estimations for the proposed approach can obtain the nominal coverage (95%) when the sample sizes approach n = 800.

REAL DATA ANALYSIS
In this section, we apply our proposed estimation and inference procedures to construct the optimal ITRs for complex patients with type-II diabetes.The data are collected from the electronic health records through Health Innovation Program at University of Wisconsin.The entire dataset includes n = 9101 patients.There are 40 covariates, including socio-demographic variables, previous disease experiences, and baseline HbA1c levels, etc.The outcome is the indicator whether the patient successfully controls the HbA1c below 8% after a year.The treatment A = 1 if the patient received any medications, including insulin, sulphnea or OHA, and A = −1 otherwise.Among 9101 patients, 17.1% had a missing post-treatment HbA1c measurement, and 15.4% had the missing baseline HbA1c measurements.We impute missing values using Multivariate Imputation by Chained Equations (MICE package in R), which is based on the estimated conditional distributions of each covariate given other covariates (van Buuren & Groothuis-Oudshoorn, 2011).To address the possible interactions among covariates, we consider both raw covariates and all first-order interactions.We rank these covariates by their variances and select p = 100 covariates with top variances.
We split the dataset into a training dataset (80% of the entire dataset) and a testing dataset (20% of the entire dataset).The proposed method and Q-learning are fitted on the training dataset using the same strategies as described in simulation studies.To evaluate these estimated ITRs, we calculate the value function by E n [Y 1{A = D}/π 0 ], on the testing dataset, where D is the Table 1.Results for comparisons on value functions.

Method
Mean Sd Observed 0.860 0.008 PEARL 0.877 0.015 Q-Learning 0.869 0.015 estimated ITR on the training dataset and π0 is estimated the propensity scores on the testing dataset.The entire procedure is repeated 100 times with random training and testing data splits.The mean and standard deviation (sd) of the value functions over these repeats are summarized in Table 1.Both the proposed and Q-learning methods construct ITRs that yield better results than the current clinical practice.Furthermore, our proposed method achieves a higher value function than Q-learning approach as shown in Table 1.
Next, we conduct the inference procedure to identify driving factors of the optimal ITR as well as to provide an interval estimation using the entire dataset.Results are presented in Table 2.After controlling for the false discovery rate (FDR ≤ 0.05), our results indicate that a female patient with a higher HbA1c value at baseline are more likely to benefit from the treatment.6.DISCUSSION In this work, we propose a penalized doubly robust approach to estimate the optimal ITR from high-dimensional observational data.Our approach involves estimations of the outcome model and the propensity score as nuisance parameters.The estimation methods for these nuisance parameters can be either parametric or non-parametric.Furthermore, we propose a split-andpooled de-correlated score test and an interval estimation procedure as inference tools to identify the driving factors of the optimal ITR.This inference approach generalizes the de-correlated score (Ning & Liu, 2017) and adopts sample-splitting to separate the estimation of the nuisance parameters from the construction of the de-correlated score (Chernozhukov et al., 2018).
In this paper, we consider a single stage problem and assume a high-dimensional linear decision rule.In practice, especially in managing chronic diseases, dynamic treatment regimes are widely adopted, where sequential decision rules for individual patients adapt overtime to the evolving disease.One future direction is to develop inferential methods in the multi-decision setup.We can also extend the linear decision rule to a single index decision rule d(X ⊤ β * ), where d is an unknown function.Throughout, we require that the surrogate loss function be differentiable.A non-differentiable surrogate loss such as the hinge loss does not have a welldefined Hessian, which hinders the construction of the de-correlated score.This can be addressed by a smoothed hinge loss or an approximation of the Hessian.We are currently working on these possible extensions.
π(−k) and Q(−k) plugged in, and λ n,k is a tuning parameter.Then, we estimate w * by Input: A random seed; n samples; a positive integer K. Output: β and a p-value for H 0 : β * 1 = 0. Randomly split data into K parts {I k } K k=1 with equal size, and set k = 1; Estimate π and Q on I c k and denote the estimator as π(−k) and Q(−k) ; Obtain a data-split PEARL estimator β(k) on I k by min β E π(−k) and Q(−k) plugged in, and λ n,k is tuned by validation on I c k ; Obtain an estimator ŵ(k) for w * by min w E 3, . . ., K, and repeat Step 2 and 5. Obtain β(k) β−1 is a p − 1 dimensional sub-vector of β without β1 .Construct the de-correlated score test statistic S( βnull , ŵ) as

Fig. 1 .
Fig. 1.Simulation results for Scenario (I) with the change of sample size when ξ = 0.7 and p = 800.Types of the line represent different coefficients.

Fig. 2 .
Fig. 2. Coverage of the coefficients for Scenario (I) with the change of sample size when ξ = 0.7 and p = 800.Types of the line represent different coefficients.

Fig. 3 .
Fig. 3. Simulation results for Scenario (II) with change of sample size when ξ = 0.4 and p = 800.Types of the line represent different coefficients.

Table 2 .
Coefficients and p-value for the identified significant covariate of the estimated optimal ITR.Special chronic conditions refer to chronic conditions including amputation, chronic blood loss, drug abuse, lymphoma, metastatistic cancer, and peptic ulcer disease.Bucketized age refers to a variable created by bucketizing the raw age by its observed quartiles.